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ABSTRACT 

The most demanding tenants of shared clouds require com¬ 
plete isolation from their neighbors, in order to guarantee 
that their application performance is not affected by other 
tenants. Unfortunately, while shared clouds can offer an op¬ 
tion whereby tenants obtain dedicated servers, they do not 
offer any network provisioning service, which would shield 
these tenants from network interference. 

In this paper, we introduce Links as a Service (LaaS), 
a new abstraction for cloud service that provides isolation 
of network links. Each tenant gets an exclusive set of links 
forming a virtual fat-tree, and is guaranteed to receive the 
exact same bandwidth and delay as if it were alone in the 
shared cloud. Consequently, each tenant can use the for¬ 
warding method that best fits its application. Under simple 
assumptions, we derive theoretical conditions for enabling 
LaaS without capacity over-provisioning in fat-trees. New 
tenants are only admitted in the network when they can be al¬ 
located hosts and links that maintain these conditions. LaaS 
is implementable with common network gear, tested to scale 
to large networks and provides full tenant isolation at the 
worst cost of a 10% reduction in the cloud utilization. 

1. INTRODUCTION 

Many owners of private data centers would like to 
move to a shared multi-tenant cloud, which can offer 
a reduced cost of ownership and better fault-tolerance. 
For some of these tenants it is vital that their appli¬ 
cations will not be affected by other tenants, and will 
keep exhibiting the same performance^ [1“3]. For ex¬ 
ample, a banking application may need to roll-up all 
accounts data overnight, and a weather prediction soft¬ 
ware should similarly complete within a highly pre¬ 
dictable time. For such tenants, run-time predictability 
is a key requirement. 

Unfortunately, distributed applications often suffer 
from unpredictable performance when run on a shared 
cloud [4,5]. This unpredictable performance is mainly 

^By performance, we refer to the inverse of either the total 
application run-time, including both the computation and 
communication times, or of the response time of online ser¬ 
vices. 


caused by two factors: server sharing and network shar¬ 
ing [6-22]. The first factor, server sharing, is easily 
addressed by using bare-metal provisioning of servers, 
such that each server is allocated to a single tenant [23]. 
However, the second factor, network sharing, is much 
more difficult to address. When network links are 
shared by several tenants, network contention can sig¬ 
nificantly worsen the application performance if other 
tenant applications consume more network resources, 
e.g. if they simply want to benchmark their network or 
run a heavy backup [24]. This can of course prove even 
worse when other tenants purposely generate adversar¬ 
ial traffic for DoS or side-channel attacks [25]. 

As detailed in Section 2, current solutions either (a) 
require tenants to provide and adhere to a specific traf¬ 
fic matrix declared in advance, which often proves im¬ 
practical [11, 21]; (b) follow the hose model by pro¬ 
viding enough throughput for any set of admissible 
traffic matrices [4,16,26], but also significantly reduce 
the link bandwidth and burst size that can be allo¬ 
cated to each VM; or (c) attempt to track the current 
traffic matrix, but cannot guarantee constant perfor¬ 
mance [14,15,17,19,22]. In addition, while it is known 
that tailoring the packet forwarding method to the spe¬ 
cific tenant application can increase its performance, 
none of the current cloud solutions allow multiple for¬ 
warding algorithms to co-exist on the same network 
without impacting performance. 

In this paper, we introduce a simple and effective ap¬ 
proach that eliminates any interference in the cloud net¬ 
work. This approach allows each tenant to use a net¬ 
work forwarding algorithm that is optimized for its own 
application. Keeping with the notion that good fences 
make good neighbors, we argue that the most demand¬ 
ing tenants should be provided with exclusive access to 
a subset of the data center links, such that each tenant 
receives its own dedicated fat-tree network. We refer 
to this cloud architecture model as Links as a Service 
(LaaS). The LaaS model guarantees that these tenants 
can obtain the exact same bandwidth and delay as if 
they were alone in the shared cloud, independently of 
the number of additional tenants. We show that al- 
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(a) No LaaS: Shared links (b) No LaaS: Bandwidth loss (c) LaaS: Full isolation 


Figure 1: Two tenants hosted on a cloud, (a) Their traffic interferes on many shared links, (b) There 
are no shared links, but the second tenant cannot service an admissible traffic from Sq and to Dq 
and Di, (c) Under LaaS conditions of tenant placement and link allocation, the network can service 
any admissible tenant traffic demands. 


location of links to tenants is cost-effective and imple¬ 
ment able by using common hardware. Note that LaaS 
can similarly support a relaxed model that splits phys¬ 
ical links into time-domain-multiplexed channels. This 
relaxed model allows multiple tenants per server, but 
requires accurate packet pacing [27] not provided by 
common hardware today. 

While the LaaS abstraction is attractive, Figure 1 il¬ 
lustrates why it can be a challenge to provide it given 
any arbitrary set of tenants. First, Fig. 1(a) illustrates 
a bare-metal allocation of distinct hosts (servers) to two 
tenants that does not satisfy the LaaS abstraction, since 
the tenants share common links. Likewise, the alloca¬ 
tion of hosts and links in Fig. 1(b) also does not satisfy 
LaaS, even though no links are shared between tenants. 
This is because, regardless of the packet forwarding al¬ 
gorithm, internal traffic of the second tenant from the 
two hosts So and Si in the right leaf switch to hosts 
Do and Di would need to share a common link, and 
so some admissible traffic patterns would not be able to 
obtain full bandwidth. Interestingly, for this host place¬ 
ment, we find that there is in fact no link allocation that 
can provide full bandwidth to all the admissible traffic 
patterns of both tenants. Finally, Fig. 1(c) fully satis¬ 
fies the LaaS conditions. All tenants obtain dedicated 
hosts and links, and can service any admissible traf¬ 
fic demands between their nodes, independently of the 
traffic of other tenants. To generalize the above exam¬ 
ples, we further analyze the fundamental requirements 
for providing LaaS guarantees to tenants in 2- and 3- 
level homogeneous fat-trees. Under minor assumptions, 
our analysis provides the necessary and sufficient condi¬ 
tions to guarantee the same bandwidth and delay per¬ 
formance over the shared fat-tree networks as when be¬ 
ing alone in the shared cloud. These conditions are 
novel and greatly reduce the complexity for the online 
allocation algorithm presented in Section 5. 

We implement a standalone LaaS scheduler that au¬ 
tomates tenant placement on top of OpenStack, as well 
as configures an InfiniBand SDN controller to provide 
forwarding without interference. Our open-source code 


is made available online [28]. We show that using this 
code, our LaaS algorithm responds to tenant requests 
within a few milliseconds, even on a cloud of UK nodes, 

1. e. several orders of magnitude faster than the time it 
takes to provisioning a new virtual machine. In ad¬ 
dition, when the average tenant size is smaller than a 
quarter of the cloud size, we find that our LaaS algo¬ 
rithm achieves a cloud utilization of about 90%, for var¬ 
ious tenant-size distributions. For larger tenant sizes, 
our LaaS allocation converges to the maximal utiliza¬ 
tion obtained by a bare-metal scheduler that packs ten¬ 
ants without constraints. Finally, to demonstrate LaaS 
strength, we show performance improvements of 50%- 
200% for highly-correlated tenant traffic generated by a 
Bulk Synchronous Parallel (BSP) application relying on 
data exchanges along a virtual three-dimensional axis 
system. Thus, the performance improvement exceeds 
the utilization cost for such applications, uncovering an 
economic potential (Section 6). 

While we focus, for brevity, on full-bisectional- 
bandwidth fat-trees, we show how LaaS can be extended 
to support over-provisioned (slimmed) fat-trees. We 
also describe how LaaS can fit more general cloud cases, 
e.g. when mixing highly-demanding tenants with regu¬ 
lar tenants (Section 7). 

Our evaluations show that LaaS is practical and ef¬ 
ficient, and completely avoids inter-tenant performance 
dependence. 

2. RELATED WORK 

Application variability. Several studies about the 
variability of cloud services and HPC application per¬ 
formance were presented by [4, 5, 24, 29-31]. They 
show significant variability for such applications, which 
strengthens the motivation for using LaaS. 

Network isolation. Specific high-dimensional tori 
super-computers like IBM BlueGene, Cray XE6, and 
the Fujitsu K-computer provide scheduling techniques 
to isolate tenants [31-33]. However, they all rely on 
forming an isolated cube on 3 out of the 5- or 6- 
dimensional torus space, and thus cannot be used in 
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clouds with fat-tree topologies. They also exhibit a 
significantly lower cluster utilization, measured as the 
amount of servers used over time, than the 90% utiliza¬ 
tion obtained by LaaS on fat-trees. Another approach, 
reduces the interference between jobs running on same 
fat-tree by applying hard placement constraints [34]. 
This work reduces but does not guarantee jobs isola¬ 
tion from each other. 

Packet forwarding. Many architectures rely on Equal 
Cost Multiple Path (ECMP) [35] to spread the allocated 
tenant traffic and avoid the need to allocate exact band¬ 
width on each of the used physical links [4,36,37]. How¬ 
ever, while ECMP load-balancing is able to balance the 
average bandwidth of many small bandwidth flows, it 
suffers from a heavy tail of the load distribution. When 
traffic contains a relatively small number of large flows, 
ECMP is known to provide poor load-balancing. Thus, 
other tenants will affect the application performance. 

Silo [27] aims to provide guaranteed latency, band¬ 
width and burst size to multiple tenants for a worst-case 
traffic pattern, assuming that tenants do not optimize 
their forwarding scheme. Silo achieves its guarantees by 
applying accurate rate- and burst-size moderation to 
enforce centrally-calculated values obtained from net¬ 
work calculus. Unfortunately, Silo does not take for¬ 
warding into account. Eor instance, consider a tenant 
of 200 VMs placed across more than one 2-level sub-tree 
(which normally can contain thousands of VMs). If 100 
VMs need to send traffic to the other VMs through the 
same uplink because of the forwarding rules, then each 
would be restricted to use at most 1/lOOth of the link 
bandwidth and 1/lOOth of the switch buffer size, which 
is unacceptable for current large tenants. LaaS allows 
the tenants to adapt their forwarding to the traffic pat¬ 
tern without introducing inter-tenant interference, thus 
allowing them to fully consume the full network band¬ 
width. 

Time separation. Some systems like Cicade [10] ac¬ 
cept the need for handling the varying nature of tenant 
traffic instead of relying only on the average demand. 
They assume that traffic demands change at a pace that 
is slow enough to enable them to react. Alternatively, 
scheduling the MapReduce shuffle stages was proposed 
by Orchestra [38]. A generalization of this approach 
that allows a tenant to describe its changing communi¬ 
cation needs is suggested by Coflow [39]. On the same 
line of thought, scheduling at a finer grain was proposed 
by Hedera [18]. However, since these schemes propose a 
fair-share network bandwidth to the current set of ap¬ 
plications, they actually change the performance of a 
tenant when new tenants are introduced. Even though 
fairness does improve, the tenant performance variabil¬ 
ity grows. 

Tenant resource allocation. Cloud network perfor¬ 
mance has received significant attention over the last 


few years. An overview of the different proposals to 
allocate tenant network resources is provided by [6]. 

Virtual Network Embedding maps tenants’ requested 
topologies and traffic matrix over arbitrary clusters [11, 
21]. However, tenants must know and declare their ex¬ 
act traffic demands which is mostly impractical. More¬ 
over, valid embedding is calculated by variants of linear 
programming, which are known not to scale as the size 
of the data centers and number of tenants grow. In 
addition, as most of these solutions rely on the tenant 
traffic matrix, they consider only the average demands, 
falling short of representing the dynamic nature of the 
application traffic. Eor example, they prove problematic 
when an application alternates between several traffic 
permutations, each utilizing the full link bandwidth. 

Other proposals, such as Topology Switching and 
Oktopus [4,16], propose an abstraction for the topol¬ 
ogy and traffic demands to be allocated to the tenants. 
They are similar to the hose model proposed for Vir¬ 
tual Private Networks in the context of WAN [40]. In 
addition, [41] attempts to provide a feedback-based fair- 
share bandwidth using edge-based rate-limiting. How¬ 
ever, to guarantee tenant latency predictability and iso¬ 
lation, such solutions would need strict time-pacing of 
packets, small limits on allowed VM bandwidth and 
burst-size allocation, as shown in [27]. As mentioned 
above, these are impractical in current networks. 

Another approach for isolation may rely on dis¬ 
tributed rate limiting like [22], NetShare [14], Scond- 
Net [17], Seawall [19], Gatekeeper [15] and Oktopus [4]. 
But distributed rate limiting at the network edge re¬ 
quires tenant-wide coordination to avoid bottlenecks 
due to load-imbalance. This coordination leads to re¬ 
sponse time in the order of milliseconds [36], while the 
life time of a traffic pattern for high-demanding appli¬ 
cations may be 2 to 3 orders of magnitude shorter. 
Fairness. EairCloud provides a generalization of the 
required fairness properties of the shared cloud net¬ 
work [42]. LaaS tenant isolation satisfies these require¬ 
ments, and avoids the allocation complexity of the gen¬ 
eral case. 

Application-based routing. The above schemes for 
network resource allocation ignore the fact that each 
tenant application may perform best with a differ¬ 
ent routing scheme. Routing algorithm types span a 
wide range. Some are completely static and optimized 
for MPI applications [43,44]. Others rely on traffic¬ 
spreading techniques like ECMP [45], rely on traffic 
spray as in RPS or DeTail [46,47], use adaptive routing 
as proposed by DARD [48], or even rely on per-packet 
synchronized schemes like EastPass [49]. LaaS isolates 
the sub-topology of each tenant, and therefore allows 
each tenant to use the routing that maximizes its appli¬ 
cation performance. Without link isolation the different 
routing engines must continuously coordinate the actual 
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Figure 2: Experimental fat-tree cluster. 



bandwidth each one of them utilize from each link. It 
is clear that the involved complexity of such scheme 
renders it slow and impractical. 

3. IMPACT OF TENANT INTERFERENCE 

This section presents the impact of concurrent tenant 
traffic on tenant performance. The presented results are 
obtained from measurements on real hardware, as well 
as simulations of InfiniBand and Ethernet networks. We 
also provide online a full description of the settings and 
of our code for the experiments [28]. 

Tenant interference in cluster experiments. The 
experimental topology is a non-blocking two-level fat- 
tree with 8 hosts in each of the 4 leaf switches. The 
leaf switches are fully connected to 4 spine switches, 
with two parallel links per connection. We assume 4 
tenants, and randomly assign 8 dedicated hosts to each 
of the 4 tenants. The reason for using a random place¬ 
ment is that even a scheduler that follows a bin-packing 
algorithm is known to show a large degree of fragmen¬ 
tation in steady state [32]. The tenants independently 
alternate between computation and all-to-all communi¬ 
cation, i.e. each node computes new results and sends 
different data to the rest of the nodes that belong to the 
same tenant, as a sequence of un-synchronized shift per¬ 
mutations. This traffic pattern is representative of the 
Shuffle stage of MapReduce, and of scientific-computing 
applications such as those based on Fast Fourier Trans¬ 
form. We keep the total computation time constant, 
while the communication time changes with the increas¬ 
ing message size (where message means a continuous 
flow between a pair of machines). For a single tenant 
with 32KB messages, the communication time repre¬ 
sents roughly 2/3 of the total time. 

Fig. 3 presents the relative application performance 
in our cluster, measured for various reasonable message 
sizes [50] and for 1-4 parallel tenants. The results show 
that even in such a small cluster, the performance of a 
tenant may degrade (i.e., its run-time may increase) by 
25% for large messages when other tenants run concur¬ 
rently. Larger message sizes degrade the performance 
due to the larger buffering needs and larger communi¬ 
cation time. 

Since we also want to analyze the performance of the 
applications in larger clusters, we further rely on a sim¬ 


Figure 3: Relative performance, obtained by ex¬ 
periment and simulation, of an application based 
on all-to-all traffic, for 1—4 concurrent tenants of 
8 hosts each. The maximal degradation is about 
25%, even for this small cluster of 32 nodes. The 
full bars on the single tenant runs demonstrate 
we normalize each run condition separately. 

ulator based on an InfiniBand model [51]. For sanity 
check, we compare our small cluster measurements with 
simulated results. The figure illustrates that the sim¬ 
ulation results for 4 tenants are about 3% worse, and 
show the same trend as the experiment. The difference 
probably results from a lack of accuracy in modeling 
the MPI computation time, and therefore it would be 
expected to decrease in larger networks with a more 
significant network contention. 

We also run stencil application on the 32 nodes clus¬ 
ter. This MPI application runs cycles of computation 
and communication on virtual x, y or z axis. We mea¬ 
sure the time to complete 100 compute/communicate 
iterations by the first job. The jobs start one after the 
other with some delay, such that the resulting measure¬ 
ment show a gradual increase of the first job iteration 
time due to the growing number of jobs interfering. The 
results are plotted in Fig. 4 which shows a degrada¬ 
tion of 43% = 0.215/0.15 in the presence of 4 parallel 
jobs. Note that on larger systems where the job sizes are 
larger and many more jobs exist the expected impact 
on job run-time is larger. 

Tenant interference in scaled-up simulations. We 

now evaluate the impact of cloud size. As the number 
of tenants and their sizes grow, we would expect an in¬ 
creased inter-tenant friction, and therefore a degraded 
application performance in the presence of concurrent 
tenants. We simulate the effect of the concurrent tenant 
traffic on a cloud of 1,728 hosts for 8 and 32 randomly- 
placed tenants, each of 216 and 54 hosts respectively. 
We measure the average relative performance of a ten¬ 
ant, defined as the ratio of its performance when run¬ 
ning concurrently with all other tenants by its per¬ 
formance when running alone. We show the impact 
of inter-tenant friction on scientific-computing appli¬ 
cations as well as on MapReduce. For the scientific- 
computing benchmark, we select stencil codes, which 
are parallel programs that break the problem space 
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Figure 4: MPI stencil computation app run¬ 
time, on a 32 nodes lOGB/s InfiniBand clus¬ 
ter, degrades by 43% with the gradual start of 
3 other similar apps. 


(mainly 3-dimensional) into sub-spaces, apply the same 
procedure to each sub-space and exchange data mostly 
with neighboring sub-spaces. This scheme is common to 
many scientific programs, and especially those solving 
partial differential equations, such as weather predic¬ 
tion and flow dynamics. The computation time is again 
kept constant while the communication time changes 
with the increasing message size. For a single tenant 
with 32KB messages, the communication time repre¬ 
sents roughly 4/9 of the total time. 

Fig. 5 shows how the relative performance of each ten¬ 
ant decreases as the number of tenants and the message 
size increase. For instance, for 32 concurrent tenants ex¬ 
changing 32KB messages, the performance degrades by 
45% compared to a tenant running alone (equivalently, 
providing isolation from concurrent tenants would more 
than double the performance). This significant loss of 
performance happens despite a modest message size of 
32KB, and presents a large source of potential run¬ 
time variability. Note that the degradation of perfor¬ 
mance is clearly a result of network contention, since 
each job runs on dedicated hosts. MapReduce (sim¬ 
ulated at similar conditions) experiences a smaller im¬ 
pact than stencil applications. Interestingly, the smaller 
interference from other tenants is a result of higher 
self-contention: due to the Shuffle all-to-all traffic pat¬ 
tern, there is network contention even when MapRe¬ 
duce runs alone. Stencil applications suffer less from 
self-contention because their traffic matrix is less dense. 
Our second set of simulations illustrates tenant inter¬ 
ference on a partition-aggregate traffic pattern, which 
is characteristic of distributed database queries run by 
many Web2.0 services like Facebook [47,52,53]. We 
simulate such a traffic pattern on the same cluster, as¬ 
suming each of the 32 tenants splits its hosts equally 
between servers and clients. The query arrivals follow 
a Poisson process with a controllable rate. Each query 
is sent to all servers in parallel. 
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Figure 5: Simulated relative performance for 
8 and 32 tenants on a cloud of 1,728 hosts, 
with Stencil scientific-computing applications or 
MapReduce-based applications. The relative 
performance to a single tenant degrades as the 
traffic volume and the number of concurrent ten¬ 
ants increase. 



Figure 6: Simulated distributed database ten¬ 
ants placed randomly on 1,728 nodes cluster. 
The percentage of queries not meeting a 10msec 
deadline vs. offered query-rate show steep satu¬ 
ration. 

Fig. 6 shows the percentage of late queries not meet¬ 
ing a 10-msec deadline. The steep increase of late 
queries happens at about 10,450 queries per second for 
the 32 concurrent tenants, versus 13,600 queries per sec¬ 
ond for a single tenant. The network link sharing re¬ 
sulted in a degradation of about 30% in the effective 
query rate. 

We further want to confirm that similar results are 
obtained for a lossy Ethernet network. We simulate a 
32-node Ethernet cluster employing ECMP routing and 
DCTCP [52], using an INET [54] simulator enhanced 
with a specially-implemented DCTCP plugin. We simu¬ 
late 32 nodes and not 1,728 nodes because this simulator 
is less scalable. There are only two tenants: The first 
is a regular 8-node tenant implementing MapReduce, 
of random Map and Reduce times and variable Shuffle 
data size (producing a similar ratio of communication 
time to total time). The second is an 8-node adversarial 
aggressor tenant. Each adversarial node continuously 
generates 1MB messages, sent in parallel to all its other 
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Figure 7: Simulated relative performance of an 
8-node MapReduce tenant on a 32-node Ether¬ 
net cluster running DCTCP. An adversarial 8- 
node tenant degrades the performance by 25%. 

nodes. We intentionally keep half the nodes unused to 
illustrate the detrimental impact of other tenants even 
in an over-provisioned cluster. Fig. 7 presents the rel¬ 
ative performance of MapReduce in the presence of the 
adversarial tenant as compared to its performance when 
running alone. The worst relative performance is ob¬ 
tained for messages of 128KB, with a degradation of 
25% even in such a small and over-provisioned clus¬ 
ter. We suspect that the increase in the last value with 
256KB message results from an artifact of DCTCP. 

4. LAAS ARCHITECTURE 

A typical cloud architecture depicted in Fig. 8 con¬ 
sists of (a) a front-end interface for tenants to register 
their requests, (b) a scheduler that decides when and 
how to service these requests and can allocate hosts to 
tenants (e.g., an OpenStack Nova scheduler and a Heat 
application setup), and (c) a network controller that 
performs the network setup (e.g., an OpenStack Neu¬ 
tron and an SDN back-end). In this section, we intro¬ 
duce a LaaS cloud architecture that enhances this ar¬ 
chitecture by enabling the allocation of tenant-exclusive 
hosts and links. 

Specifically, we propose to extend the scheduler with 
link allocation functionality (on top of the host alloca¬ 
tion), and enhance the network controller by adding 
network routing rules to enforce the link allocation. 
Fig. 8 emphasizes these two extensions by bold lines on 
an abstract cloud management software architecture. 
Scheduler. We require the scheduler to provide each 
new tenant with an exclusive set of dedicated hosts and 
dedicated links. As in bare-metal allocation, a tenant 
may request a given number of dedicated hosts^ which 
may be further refined by requirements of memory, ac¬ 
celerators or number of cores. In our implementation, 
we assume homogeneous hosts. In addition, the sched¬ 
uler provides each new tenant with a set of dedicated 
links that form a tenant sub-topology, which will guar¬ 
antee full bandwidth for any admissible traffic matrix 
of the tenant, i.e. will provide the tenant with the same 



Figure 8: Cloud management system architec¬ 
ture, with LaaS extensions in bold. 

bandwidth as in its own private data center. 

In the LaaS architecture, we assume that the sched¬ 
uler employs an online algorithm, by successively pro¬ 
cessing one new tenant request at a time. Each new 
tenant may be either accepted to the cloud, or de¬ 
nied due to the unavailability of a sub-network that 
can provide enough dedicated hosts and links. In any 
case, the scheduler does not migrate already-running 
tenants. This could be relaxed if we want to allow 
global optimization of host placements, by running ten¬ 
ants over virtual machines (VMs) and allowing migra¬ 
tions [55-57]. But then, tenant run-times would be im¬ 
pacted by the arrival of new tenants, which is precisely 
what we want to avoid. 

Network controller. As depicted in Fig. 8, we re¬ 
quire the information of the allocated links to be pro¬ 
vided by the scheduler to the network devices. This 
information should be used to adjust the network for¬ 
warding and routing to provide tenant isolation. This 
task fits SDN networks, but may also be implemented in 
other network architectures like TRILL [58]. There are 
several different ways to implement such an isolation- 
aware network controller. At one extreme, which re¬ 
quires switch-virtualization hardware support, a master 
controller may configure the underlying switches to be 
split into multiple virtual switches [20]. Then each ten¬ 
ant may incorporate its own SDN controller, which can 
then only discover its own isolated sub-topology. An¬ 
other approach is to let a single SDN controller do all 
the work and enhance all the routing engines to work 
on sub-topologies. We rely in our implementation on an 
off-the-shelf InfiniBand SDN controller with a capabil¬ 
ity of defining sub-topologies and routing packets in an 
isolated manner (L2 forwarding). This feature, known 
as Routing Chains, is described in [59]. This isolated- 
routing feature could also be implemented by Ethernet 
SDN controllers like OpenDaylight. 

5. LAAS ALGORITHM 

In this section, we describe online algorithms for ten¬ 
ant placement and link allocation in the LaaS sched¬ 
uler. Online placement algorithms require the exist- 
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ing tenant placement to be maintained when a new job 
is placed, and therefore do not move existing tenants. 
Similarly we provide online link-allocation algorithms 
to avoid any traffic interruption when a new tenant is 
introduced. The algorithm we describe provably guar¬ 
antees that a tenant will obtain a dedicated set of hosts 
and links, with the same bandwidth as in its own pri¬ 
vate data center. The algorithm relies on the required 
properties of the placement to trim the solution space 
and achieve fast results. 

We first study 2-level fat-trees, and then generalize 
the results to 3 levels. We first present a Simple heuris¬ 
tic algorithm, and then extend it with a LaaS algorithm 
that achieves a better cloud utilization. 

5.1 Isolation for 2-level Fat Trees 

Consider a 2 -level full-bisectional-bandwidth fat-tree 
topology, i.e. a Full Bipartite Graph between leaf 
switches and spine switches, as in Fig. 1 above. For 
brevity we denote Full Bipartite Graphs that make the 
fat-tree connections between switches at levels Ivli and 
FBGi. It is composed of r leaf switches, de¬ 
noted Li for each i G [1, r], and m spine switches. Each 
leaf switch is connected to n < m hosts as required to 
meet the rearrangeably non-blocking condition for fat- 
trees [60]. 

Problem definition. Given a pre-allocation of tenants 
(with pre-assigned links and hosts), when a new tenant 
arrives with a request for N hosts, we need to find: 

(i) Host placement: Find which free hosts to allocate to 
the new tenant, i.e. allocate Ni free hosts in each leaf i 

r 

such that N = ^ N^. 

i=l 

(a) Link allocation: Find how to support the tenant 
traffic, i.e. allocate a set Si of spines for each leaf i, such 
that the hosts of the new tenant in leaf i can exclusively 
use the links to 5'^, and the resulting allocation can fully 
service any admissible traffic matrix. 

We want to fit as many arriving tenants as possible 
into the cloud such that their host placement and link 
allocation obey the above requirements, and without 
changing pre-existing tenant allocations. 

Simple heuristic algorithm. We first introduce a 
Simple heuristic algorithm, as basis for the discussion 
of our algorithm. It relies on a property of fat-trees 
and minimum-hop routing: if a single tenant is placed 
within a sub-tree, then traffic from other tenants will 
not be routed through that sub-tree. Note that for 2 - 
level fat-trees a sub-tree is a leaf switch. 

Let N denote the number of tenant hosts, and n the 
number of hosts per leaf. The Simple heuristic sim¬ 
ply computes the minimal number s of leaf switches 
required for the tenant: s = \N/n]. Then, it finds 5 
empty leaf switches to place the tenant hosts in. Fi¬ 
nally, if 5 > 1 , it allocates all the up-links leaving the s 



Figure 9: Two tenants of sizes 6 and 7 hosts 
placed by the Simple heuristic, where each ten¬ 
ant fills a number of complete sub-trees. 

leaf switches; else, no such links are needed. 

Fig. 9 illustrates the Simple algorithm, showing how 
tenant Ti obtains a placement for = 6 hosts. First, 
s = [6/4] = 2 . Assuming Ti arrives first, the two left 
leaves are available when it arrives, and they are used to 
host Ti. Also, all the up-links of these 2 leaf switches are 
allocated to Ti. When it arrives, tenant T 2 is similarly 
allocated the two right leaves and their up-links. 

In the general case, any placement obtained by Simple 
supports any admissible traffic pattern. This is because 
the dedicated sub-network of the tenant is a single leaf 
switch if s = 1, and a 2-level fat-tree if 5 > 1, which is a 
folded-Glos network with m > n. It is well known that 
such a topology supports any admissible traffic pattern, 
because it meets the rearrangeable non-blocking crite¬ 
ria and the Birkhoff-von Neumann doubly-stochastic 
matrix-decomposition theorem [60]. 

Extended Simple Heuristic. It is possible to allow 
a single tenant with hosts within a sub-tree to span 
across multiple sub-trees. The same argument used for 
the simple case holds for the extended case since only 
the traffic of the single tenant, leaving the sub-tree, is 
crossing the top level of the sub-tree. Thus isolation is 
maintained. Since the entire set of links at the top layer 
must match the number of hosts within that sub-tree 
the obtained topology supports any admissible traffic 
matrix. 

For example Fig. 10 shows how tenant T 3 occupies 
a part of a leaf sub-tree which is shared by tenant Ti 
extending out of that sub-tree. No traffic other that of 
Ti would need to leave the same sub-tree and thus all 
the top links in that tree are allocated to tenant Ti. 
LaaS placement analysis. This section describes a 
required condition on placement and sufficient condition 
on link allocation that are key to make the LaaS algo¬ 
rithm correct and efficient. The placement condition 
requires the allocation of N tenant hosts as Q leaves 
of D hosts and optionally additional leaf of R \ R < D 
hosts such that N = QD-\-R. The sufficient link alloca¬ 
tion condition requires the links of R spines connecting 
to the Q leaves and the optional single leaf of R hosts. 
A subset of size D — R of these spines should connect 
just to the Q leaves. 
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Figure 10 : Three tenants placed by the Extended 
Simple heuristic. Note that Ti and T 3 are shar¬ 
ing the same sub-tree (leaf) but only one of 
them, Ti is allowed to expand out of the shared 
sub-tree. 

Consider a single leaf i with Ni tenant hosts. In the 
analysis below, we make the following simplifying as¬ 
sumption: on every leaf switch, the number of leaf-to- 
spine links (and the corresponding number of spines) 
allocated to a tenant equals the number of its allocated 
hosts: 

\Si\=Ni. (1) 

Our simplifying assumption is based on the following 
intuition. On the one hand, for tenants occupying sev¬ 
eral leaves, if \Si\ < Ni, we may not be able to ser¬ 
vice all admissible traffic demands (since we may have 
up to Ni flows that need to exit leaf i, but only |S'^| 
links to service them). On the other hand, allocating 
|5'^| > Ni, is wasteful, because the number of remaining 
spine switches would then be less than the number of 
available hosts, and therefore future tenants spanning 
more than one leaf may not be able to obtain enough 
links to connect their hosts. 

Without loss of generality, we also make a nota- 
tional assumption that the A^^’s are sorted such that 
0 < Ni < N 2 < • • • < Nf, where t is the number of 
leaves connected to hosts allocated to the tenant. 

We will now see that our assumptions lead (by a se¬ 
quence of lemmas) to a simple rule that greatly simpli¬ 
fies the possible placements that need to be evaluated 
by our LaaS scheduling algorithm. 

Lemma 1. The number of eommon spines that eon- 
neet two leaves must at least equal their minimal number 
of alloeated hosts: 

^i<je[l,t]:Ni = imn{Ni,Nj) < (2) 

Proof. Consider a traffic permutation among the 
tenant hosts. There are up to Ni full-link-capacity host- 
to-host flows going from Li to Lj (or back). Since each 
flow has to use a different link and each link goes to a 
different spine switch, we will need at least Ni common 
spine switches in 15'^ D 5'j |. □ 


Lemma 2. The number of eommon spines that eon- 
neet two leaves to a third must at least equal the min¬ 
imal number of alloeated hosts, either in the union of 
the first two leaves or in the third, i.e. \/i,j,k G [l,t] : 
min(A^i -h Nj,Nk) < \Si U 5'j|. 

Proof. Let c = min(A^^ + Nj,Nk). There are at 
most c flows going from L/c to either Li or Lj (or back). 
Since each flow has to use a different link and each link 
goes to a different spine switch, we will need at least c 
spines in the union Si U Sj of the spines connected to 
the two leaves. □ 

Lemma 3. The number of alloeated hosts in any leaf 
eannot exeeed the number in the union of any two other 
leaves, i.e. \li ^ j k ^ [l,t] : Ni,Nj,Nk > 0 ^ 
Ni -h Nj > Nk 

Proof. Assume the contrary: Ni Nj < Nk. There 
are only two cases: Ni < Nj < Nj^ or Nj < Ni < Nj^. 
W.l.o.g., we assume the first. If so, mm{Ni^Nj, N^) = 
Ni + Nj. By Lemma 1, to enable connectivity be¬ 
tween Ni and Nj, they must have at least Ni spines 
in common: |5'^n5'j| > Ni. Substituting the above 
into Lemma 2 we obtain: \/i,j,k G [l,t] : min(A^^ + 
Nj,Nk) = Ni^Nj < U 5,1 = |5,| + |5,| - |5, H 5,-|. 
But since Ni = \Si\ and Nj = |5,| in LaaS by Equa¬ 
tion (1), we get 0 < — |5i n 5j|. But Si D Sj is non¬ 
empty because otherwise traffic from hosts in leaf i to 
hosts in j wouldn’t be able to pass. So we get a contra¬ 
diction, thus Ni Nj > Nj^. □ 

Necessary host placement. We will now provide two 
theorems showing necessary and sufficient conditions to 
get the LaaS conditions of tenant traffic isolation and 
support for any admissible traffic matrix. Interestingly, 
the first theorem requires neeessary eonditions on the 
host plaeement, while the second theorem provides suf- 
fieient eonditions on the link alloeation. We continue to 
assume throughout the rest of the paper that \Si\ = Ni 
for all i, and Ni < N 2 < • • • < Nf. 

Theorem 1. A neeessary eondition for LaaS is 

Ni<N2 = Ns = ^^^ = Nt, (3) 

implying that all leaf switehes of a tenant should hold 
the exaet same number of hosts exeept for a potential 
smaller one. 

Proof. We show that N 2 = Nf. By Lemma 1, Li 
and 1/2 must have at least Ni = |5i| spines in common, 
i.e. Si C (5i n 52 ). Therefore, Si is a subset of S 2 , so 
|5i U 52 I = |52| = N 2 . By Lemma 3, when i = I, j = 2 
and k = t, Ni N 2 > Nt thus min(A^i + N 2 , Nt) = Nf. 
So, when Nt flows are sent from Lt to Li and L 2 , we 
must have at least Nt common spines: |5i U ^ 2 ! = N 2 > 
Nt. But since N 2 < Nt, it follows that N 2 = Nt. □ 
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Figure 11: A tenant of N = S = Q'D-\-R hosts. 
To implement LaaS, there must be Q leaves of 
D hosts and optionally one leaf of R < D hosts. 

Given Theorem 1, the tenant placement should follow 
the form: N = Q • D R, where Q is the number of re¬ 
peated leaves with D hosts each, and we optionally add 
one unique leaf with a smaller number of hosts R. This 
notation follows the Divisor, Quotient and Remainder 
of N. This result is useful because it greatly simpli¬ 
fies the solution of the host placement problem defined 
above. 

Fig. 11 demonstrates this result. It shows Q leaf 
switches of D hosts each, and optionally another leaf 
switch of R < D hosts. We denote by the set of 
spines connected by allocated links to the Q leaves of D 
hosts, and by those that connect via allocated links 
to the optional leaf of R hosts. 

Sufficient link allocation. We can now prove suffi¬ 
cient conditions on the link allocation to satisfy LaaS. 

Theorem 2. A sufficient condition for LaaS is that 
the link allocation satisfies Vi G [1,Q] ' Si = S^ and if 
R> 0 : S^ C 5'^; i.e. all the allocated leaf up-links of a 
given tenant go to the exact same set of spine switches 
(or a subset of it for the remainder leaf). 

Proof. For the case R = 0, the link allocation above 
means there is a group of D spine switches that connect 
to all leaf switches. Thus the tenant sub-topology re¬ 
duces to an Full Bipartite Graph (FBG) with m' = D 
spine switches and n' = D hosts per leaf. Since m' = n' 
such topology is rearrangeable non-blocking folded-Clos 
which is known to support any admissible traffic matrix 
as mentioned above. 

For the case of one additional leaf Lj^ of R hosts, we 
provide a constructive method for routing arbitrary per¬ 
mutations. We consider the FBG sub-topology formed 
by the tenant hosts and links, where Lj^ connects to 
all S^ spines. For this topology m' = n' = D and 
r' = Q + 1. Again, m' = n' so it is guaranteed by 
the rearrangeable non-blocking theorem that every full 
permutation of n'-r' flows is route-able. Routing is sym¬ 
metric with respect to the spine switches. Moreover, to 
avoid congestion, each spine needs to carry exactly 1 
flow from each leaf and 1 flow to each leaf. So any full 
permutation of our original topology where Lj^ has only 
R flows will he D — R flows short. We extend these flows 
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Figure 12: Illustration that a simple host place¬ 
ment is not sufficient, and a joint host place¬ 
ment and link allocation is necessary for LaaS. 
(a) All tenants satisfy the host placement nec¬ 
essary conditions, e.g. the placement of C is 
3 = Q’D-\-R = 2- l-\-l, A and B support any 
admissible traffic matrix by the sufficient link 
allocation conditions, (b) However, the link al¬ 
location for C is impossible. There is no way to 
find a common set of spines with free ports. 

with D — R flows going from Lj^ to Lj^. Since these 
flows share the same leaf switch they must be routed 
through D — R different spines. After completing the 
full permutation routing, and since all spines connect 
to all leaves, we swap between each spine that carries 
one of the added D — R flows with a spine that is not 
included in S^. As the links allocated to the extra flows 
are not needed, any permutation is fully routed by the 
original topology. □ 

A necessary host allocation is not sufficient. The 

above theorems provide us with guidelines for imple¬ 
menting LaaS. We now show that due to previous ten¬ 
ant allocations, a host placement as in Theorem 1 is 
not always sufficient to provide a needed link allocation 
as in Theorem 2. This is why Theorem 2 proves essen¬ 
tial. If the link allocation cannot be found for a specific 
placement our algorithm will need to search for another 
host allocation. 

Lemma 4. A host placement that meets Theorem 1 
does not guarantee the existence of a link allocation 
that meets Theorem 2, and therefore does not guarantee 
LaaS. 

Proof. We prove Lemma 4 by the example provided 
in Fig. 12. Three tenants are shown placed according 
to the provided heuristic of the previous section: A has 
8 = 2 • 3 + 2 hosts, B has 5 = 2 • 2 + 1, and C has 3 = 
1-2 + 1. We track allocated up-links of the leaf switches 
in a matrix where rows represent the leaf switches and 
columns represent the spines each port connects to. As 
can be observed, there is no possible link allocation for 
tenant C, since the leaves it is placed on do not have 
free links connected to any common spine. There is 
no link allocation possible for C even though it was 
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placed according to the conditions of Theorem 1. The 
online link allocation algorithm for C (after A and B 
are placed) cannot allocate the links. In fact, even an 
offline version of link allocation - reassigning the links of 
A and B - cannot solve the problem once the placement 
of A and B does not change. □ 

According to Lemma 4, some tenant requests may be 
denied because the scheduler cannot find a proper link 
allocation. Thus any LaaS algorithm has to validate 
the feasibility of a link allocation for each legal host 
placement. 

5.2 Isolation for 3-level Fat Trees 

So far we have discussed the LaaS allocation for 2- 
level fat-trees. We now extend the results to 3-level 
fat-trees, which form the most common cloud topol¬ 
ogy [61,62]. We use the notation of Extended Gener¬ 
alized Fat Trees (XGFT) [63], which defines fat-trees 
of h levels and the number of sub-trees at each level: 
mi, m 2 ,..., rrih- and the number of parent switches at 
each level: rci, rc 2 ,..., rc/,,. 

We consider three approaches to this problem: a Sim¬ 
ple heuristics, a Hierarchical decomposition, and an Ap¬ 
proximated scheme. We conclude with a description of 
the final LaaS algorithm that we implemented, relying 
on the Approximated scheme. 

Simple heuristic for 3-level fat-trees. The Sim¬ 
ple algorithm described in sub-section ’Simple heuristic 
algorithm’ is easily extended to any fat-tree size. For 
an arbitrary XGFT, first define the number of hosts Ri 
under a sub-tree of level /: i?o = 0, and Ri = 

Given a tenant request for N hosts. Simple first de¬ 
termines the minimum level Imin of the tree that can 
contain all N tenant hosts: 

Imin = min{/| {Ri_i <N)A {Ri > TV)} (4) 

and the number 5 of required sub-trees of level Imim 
s = \N/Ri^.^_i\. Then, it places the tenant hosts in 
s free sub-trees of level Imin- It also allocates to the 
tenant all the links internal to these s sub-trees; and if 
5 > 1, it allocates as well all the links connecting the 
sub-trees to the upper level. 

It is clear that the Simple heuristic algorithm, by 
rounding up the number of nodes, trades off cluster uti¬ 
lization for simplicity, non-fragmentation, and greater 
locality with lower hop distances. As we show in the 
evaluation section, the utilization obtained by this al¬ 
gorithm is low, making it potentially unacceptable to 
cloud vendors, so we keep looking for a better one. 
Hierarchical decomposition. In this section we de¬ 
scribe how LaaS can be provided to a 3-level fat-tree 
using a hierarchical decomposition approach following 
the recursive description of fat-trees in [64]. 

Fig. 13 shows an example of 3-level fat-tree. We de¬ 
note the switches on the tree by their levels (from bot¬ 


tom up) Ivli, lvl 2 and Ivl^,. We show that for a LaaS 
link allocation to be feasible, the condition of Theorem 1 
needs to hold not only for each 2 -level sub-tree but also 
for each lvl 2 - Ivls Full Bipartite Graph {FBG 2 ) at 
the top of the tree. One of these FBGs is highlighted 
in Fig. 13. 

As we showed in the previous sections, since the ten¬ 
ant traffic pattern may be completely contained within 
each 2 -level tree, host allocation in each 2 -level tree 
must adhere to Theorem 1 . So the number of tenant 
hosts within the 2 -level sub-tree j must be of the form 
Nj = Qj ’ Dj FRj- Note that an allocation that fits in a 
single leaf switch also follows this scheme with Qj = 1. 

Fig. 13 depicts a Theorem l-compliant host allocation 
within each of the 2-level sub-trees. It follows the form: 
Nj = Qj • Dj + Rj\j G {l...m 3 }. Note that the link 
assignment within the 2 -level sub-trees must also adhere 
to Theorem 2 such that S^ C S^ . Gonsequently, the 
maximum number Uj of flows leaving the 2 -level sub¬ 
tree from switch 5 can be either 0 in case s ^ 5'^, Qj 
in case 5 G Sf\Sf, or Qj + 1 if 5 G Sf. 

When we consider the conditions required for the 
highlighted FBG 2 to support any admissible traffic pat¬ 
tern, it is strikingly similar to the analysis we provided 
for the 2-level fat-tree. For the 2-level tree we already 
proved that in order to support any admissible traffic 
pattern, the sequence of Uj values must meet the rule 
Ui < U 2 = = ' " = Um 3 ‘ Applying the same to 

the 3-level tree we obtain a requirement for the assign¬ 
ments of Uj on each of the FBG 2 - However, each one 
of the FBGi (there are m 3 such 2 -level sub-trees) could 
select a different set of Sj^ and S^. This means that 
a solution could allow each 2 -level sub-tree to select a 
different set of FBG 2 to carry its flows, as long as the 
above rule is maintained for each FBG 2 - 

Unfortunately the above rule still allows a vast 
amount of legal tenant-placement and link-allocation 
possibilities, which make the full 3-level fat-tree LaaS 
problem too hard to be solved in practical time even on 
high-end processors. If we were to provide an optimal 
allocation we would conclude here that our problem is 
too hard. But our task is not to find the optimal solu¬ 
tion, or even any solution at a specific iteration. Our 
target is to show that there is a simple enough algorithm 
that would be able to handle the online LaaS problem in 
reasonable time and with reasonable success rate such 
that the cluster utilization remains high and LaaS is 
guaranteed. We do that by applying a restriction on 
the solution space of the hierarchical decomposition. 
Approximated algorithm. We provide a simpler al¬ 
gorithm that compromises cluster utilization in favor of 
reduction of the solution search space. Our approxi¬ 
mation requires the allocation to be symmetrical with 
respect to all the FHG 2 , i.e. that the allocation on all 
the FBG 2 is identical and thus calculated just once. 
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Figure 13: A 3-level fat-tree showing the host al¬ 
location on each 2-level sub-tree matching Theo¬ 
rem 1 . One of the lvl 2 - Ivls Pull Bipartite Graphs 
(FBG2) is highlighted. We denote as Uj the max¬ 
imal number of flows injected into this FBG2 
from the 2 -level sub-tree. 

So the solution must use the same number of flows Uj 
leaving any one of the lvl 2 switches in the same 2 -level 
sub-tree. Note that any allocation where the number 
of tenant hosts Ni connected to leaf switch i does not 
include all the hosts on that leaf switch Ni < mi, will 
not utilize all the links from that switch to the upper- 
level switches. So only a subset of the lvl 2 switches in 
the same FBGi is going to pass traffic of that tenant. 
Thus if we now consider the lvl2 to Ivls traffic, not all 
FBG2 will see the same Uj. To avoid this we require 
that D is either 0 or mi for all 2-level sub-trees, except 
where the tenant fits within the same 2 -level fat-tree 
and thus Uj = 0 . As a consequence, if a tenant can¬ 
not fit within a single sub-tree, we round up its size 
to a multiple of mi. The host placement can now be 
performed in complete leaf switches of mi hosts. For 
instance, if each leaf switch can hold 10 hosts, and a ten¬ 
ant requests N = 267 hosts, then we effectively allocate 
it N' = mi \N/mi \ = 270 hosts. 

Moreover, since the approximation in 3-level fat-tree 
allocates complete Ivli switches, it is equivalent to 
the 2 -level LaaS problem: Ivli switches are equivalent 
to hosts, lvl 2 switches are like leaf switches and /u /3 
switches are like spines. Thus the approximated 3-level 
fat-tree LaaS problem has to comply to the same con¬ 
ditions as for the 2-level tree. We denote the allocation 
of full Ivli switches using a similar notation to the 2 - 
level: Q' is the number of allocated 2 -level sub-trees, 
each with D' = Q leaves. Optionally there may be 
one additional 2-level sub-tree with R' allocated leaves. 
N' = \N/mi] =Q' ’D' FR'. 

An example of such allocation for a tenant of 32 hosts 
on a 3-level fat-tree, with mi = 4 hosts per leaf, is 
provided in Fig. 14. On the left Q' = 2 sub-trees, the 
tenant uses D' = 3 leaves and thus Ui = U 2 = 3 for 
all FBG 2 ‘ In addition a single unique sub-tree r with 
R' = 2 leaves is also allocated and thus Ur = 2 for 
all FBG2‘ So all the FBG2 are thus identical. Each 


Figure 14: An example of host placement with 
N = 32 hosts on a 3-level fat-tree using the Ap¬ 
proximated method. Using a notation similar to 
the 2-level fat-tree, this allocation is of the form: 

Q' = 2, D' = 3 and R' = 2. 

Algorithm 1 FLAP(T), Q, R, /, /e, t, {ports} , {rl}) 

1: // find next Q size leaf 

2: for i = I to le do 

3: if \M [i]| >= Q then 

4: {nPorts} = {ports} n M [i] 

5: if \nPorts\ > Q then 

6: {newRL} = {rl} U i 

7: if r = D then 

8: // found all repeated leaves 

9: if findUniqueLeaf(i?, ^e; {rZ}) then 

10: {DpoRTs} = {nPorts} 

11: {DL} = {newRL} 

12: return true 

13: end if 

14: else 

15: ji’ = i + l;s = r+ l 

16: if FLAP(D,Q,Kd{nPorts},{newRL}) then 

17: return true 

18: end if 

19: end if 

20: end if 

21: end if 

22: end for 
23: return false 


one of them has to support Q' lvl 2 switches of D' = 3 
flows and one lvl 2 switch with R' = 2 flows. These 
requirements meet the condition of Theorem 1 and thus 
may be feasible. 

LaaS algorithm. 

We now want to implement our final LaaS algorithm 
for concurrent host placement and link allocation in fat- 
trees. To do so, we rely on our Approximated approach, 
and track the allocated up-links in a matrix similar to 
Fig. 15(a). The required set of leaves and links is of 
the form N = Q • D F R. As described in the sub¬ 
section ’LaaS placement analysis’, in a general fat-tree, 
this translates to R spines that connect to all the Q + 1 
allocated leaves and D — R spines connected just to the 
Q repeated leaves. These requirements are equivalent 
to finding a set of Q leaves that have D free up-ports to 
a common set of spines, and a single leaf that has only 
R free up-ports that form a subset of the spines used 
by the previous Q leaves. 

The search for Q leaves with enough common spines 
is performed recursively. In the worst case, it may re- 
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Algorithm 2 LAAS(A) 

1: // Try 1 level allocation 

2 : if AT < mi then 

3: for I = 0 TO 1712 • — 1 do 

4: if FLAP(Ar, 1,0,/, Z,0, {},{}) then 

5: return true 

6 : end if 

7: end for 

8 : end if 

9: // Try 2 level allocation 
10 : if AT < mi • m 2 then 
11: for D = max{N^mi) to 1 do 

12 : Q=[SJ 

13: R = N - Q ■ D 

14: for / = 0 TO m 3 - 1 do 

15: if FLAP(D, Q, R, I - m 2 , (/ + !)• m 2 — 1,0,{},{}) then 

16: return true 

17: end if 

18: end for 

19: end for 

20 : end if 

21: // Try 3 level allocation 

22 : U= 

23: for D = max(U, m 2 ) to 1 do 
24: Q=l%\ 

25: R=U-Q-D 

26: if Q < m 3 then 

27: if FLAP2{D, Q, R, 0, m 3 - 1, 0, {} , {}) then 

28: return true 

29: end if 

30: end if 

31: end for 

32: return false 



allocation if needed. 

In the following section we describe the algorithm 
for mapping free leaves. The algorithm to perform the 
above example is provided in Algorithm 1. The recur¬ 
sive function is assuming the availability of matrix M [/] 
of free ports on each leaf switch. It is given the following 
constants: D^R^Q and the start and end leaf switch in¬ 
dexes /s,/e- The recursive function provides its current 
state on the recursion using the following variables: I 
represents the current leaf index to examine, r the num¬ 
ber of Q size leaves that were already found, {ports} the 
set of ports that are possible for this allocation, {rl} 
the collected set of, so far, Q size leaves. Eventually 
the recursion provides the following results: {Dl} set 
of leaves with Q hosts, {Dports} the set of ports to 
be used by the Q size leaves. Up the unique, sized i7, 
leaf and {Uports} the ports on that leaf. The higher 
level algorithm considering the possible valid combina¬ 
tions of Q^D and i7, for 2-level and 3-level fat-trees is 
provided in Algorithm 2. 

Extension for over-subscribed fat trees. In or¬ 
der to reduce the network equipment cost, some cloud 
vendors use over-subscribed fat-trees, also known as 
slimmed fat-trees [65]. In an over-subscribed fat-tree, 
the number of uplinks is smaller than the number of 
downlinks in the switches, contrarily to the full bisec¬ 
tional bandwidth fat-tree, where they are equal. (We 
assume equal-bandwidth links). In such trees, we de¬ 
note Oi the ratio between the two total number of links: 
those connecting switches at level i to the previous level 
i — 1, and those connecting to the next level i + 1. By 
this definition for XGFT: 




rrij 

Wipi 


( 5 ) 


Figure 15: Example of allocation with 2 poten¬ 
tial placements, (a) Table of leaf up-links hold¬ 
ing the link assignments of tenants A and B, as 
well as 2 faulty links X. (b) Corresponding topol¬ 
ogy. The new tenant C of 10 hosts, arranged as 
Q + = 2 4-1-2, can be assigned one of two allo¬ 

cations. In (a), the first link allocation is shown 
in solid, and the second with slanted lines. 

quire examining all combinations. Our LaaS al¬ 
gorithm returns the first successful allocation, so try¬ 
ing the most-used leaves first packs the allocations and 
achieves the best overall utilization results. 

Fig. 15 demonstrates the process of evaluating a spe¬ 
cific D^Q^R division. Consider a new tenant C of 10 
hosts, arranged as 2 leaves of 4 hosts plus 1 leaf of 2 
hosts. We show 2 possible placements: The first would 
use 4 hosts on leaves 4 and 5, and 2 hosts on another 
leaf 6. The second would use 4 hosts on leaves 3 and 4, 
and 2 hosts on another leaf 2. We also illustrate how 
we could take into account two faulty links in our link 


We describe here how to provide LaaS for over¬ 
subscribed fat-trees, without requiring hardware- 
assisted accurate TDMA link sharing. For simplicity we 
do not support tenant selection of their requested band¬ 
width. Since we allow no link-sharing between tenants, 
and we have no preference between tenants, a tenant 
placed across a level i of the tree has at least Oi permu¬ 
tation flows shared on each link. So for crossing level i 
we only require S common switches at level i -h 1: 


s = \s^ 


D 


(6) 


Clearly, a selection of D such that it is not divisible 
by \Oi] reduces the cluster utilization, so the order by 
which we search for sub-trees should reflect that prior¬ 
ity. The changes to Algorithm 1 are a new function ar¬ 
gument S which defines the number of spines required, 
and its usage in line 7: ifr = S then. The changes to 
Algorithm 2 involve adding an S of Equation (6) to the 
calls of FLAP and also adding an external loop around 
the for statements in lines 11 — 19 and 23 — 31 to try 
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D values divisible by \0i \ first. 

6. EVALUATION 

Our evaluation is reported in three sub-sections. The 
first deals with the resulting cloud utilization when ap¬ 
plying LaaS conditions. It shows that our LaaS al¬ 
gorithm reaches a reasonable cloud utilization, within 
about 10% of bare-metal allocation. The second part 
describes the system implementation on top of Open- 
Stack, and the third part shows how the LaaS archi¬ 
tecture improves the performance of a tenant in the 
presence of other tenants by completely isolating the 
tenants from each other. 

6.1 Evaluation of Cloud Utilization 

Cloud utilization. We want to study whether our 
LaaS network isolation constraints significantly reduce 
the number of hosts that can be allocated to tenants. 
We define the cloud utilization as the average percent¬ 
age of allocated hosts in steady state. Assuming that 
tenants pay a fee proportional to the number of used 
hosts and the time used, the cloud utilization is a di¬ 
rect measure of the revenue of the cloud provider. 
Scheduling simulator. To evaluate the different 
heuristics on large-scale clouds, we developed a schedul¬ 
ing simulator that runs many tenant requests over a 
user-defined topology. The simulator is configured to 
run any of the above algorithms for host and link al¬ 
location. This algorithm may succeed and place the 
tenant, or fail. We use a strict FIFO scheduling, i.e. 
when a tenant fails, it blocks the entire queue of up¬ 
coming tenants. Note that this blocking assumption 
forms an extremely conservative approach in terms of 
cloud utilization. In practice, clouds would typically 
not allow a single tenant to block the entire queue and 
use resource reservation with back-filling techniques to 
overcome such cases. Since smaller tenants are easier 
to place, for any tenant size distribution, not letting 
smaller tenants bypass those waiting means that we fill 
fewer tenants into the cloud. Thus, the result should 
be regarded as an intuitive lower-bound for a real-life 
cloud utilization. 

Simulation settings. We simulate the scheduler 
with LaaS algorithm on the largest full-bisectional- 
bandwidth 3-level fat-tree network that can be built 
with 36-port switches, i.e. a cloud of 11,662 hosts. The 
evaluation uses a randomized sequence of 10,000 ten¬ 
ant requests. A random run-time in the range of 20 to 
3,000 time units is assigned to each tenant. The varia¬ 
tion of run-time makes scheduling harder as it increases 
fragmentation. 

We evaluate 4 distribution types for the number of 
hosts requested by incoming tenants. First, we ran¬ 
domly generate sizes according to a job size distribution 
extracted from the Julich JUROPA job scheduler traces. 



Figure 16: Utilization is measured after the first 
tenant cannot enter the cloud and before the 
cloud starts draining out of tenants. 

These previously-unpublished traces represent 1.5 years 
of activity (Jan. 2010 - June 2011) of Julich JURUPA, a 
large high-performance scientific-computing cloud. Sec¬ 
ond, we use a truncated exponential distribution of vari¬ 
able average x. It is truncated between 1 and the cluster 
size. Then, we evaluate a truncated Gaussian distribu¬ 
tion of average parameter x and standard deviation pa¬ 
rameter |. Last, we evaluated a uniform distribution 
of tenant sizes with a variable average x and range of 
[0.2x, l.Sx]. 

As a baseline algorithm, we implement an Uncon¬ 
strained placement approach that simply allocates un¬ 
used hosts to the request, as in bare-metal allocation. 
Note that some requests may still fail if the tenant re¬ 
quests more hosts than the number of currently-free 
cloud hosts. We compare this baseline to the Simple 
and LaaS algorithms, as described in Section 5. 
Simulation results. Fig. 17(a) illustrates the Cumu¬ 
lative Distribution Function (CDF) of the tenant sizes 
(in number of hosts) collected from the Julich JUROPA 
cluster. The CDF shows peaks for numbers of hosts 
that are powers of 2 (1, 2, 4, 8, 16, and 32). We further 
generated 10,000 tenants with this job-size probability 
distribution, and the same random run-time distribu¬ 
tion as above (instead of the original run-times, since 
they resulted in a low load, and therefore in an easy 
allocation). Fig. 17(b) shows the tenant allocation re¬ 
sults: the cost of our LaaS allocation versus the Uncon¬ 
strained bare-metal provisioning is about 10% of cloud 
utilization (88% vs. 98%). 

To further test the sensitivity of our algorithm to the 
tenant sizes, we use a truncated exponential distribu¬ 
tion for tenant host sizes and modify the exponential 
parameter x. The distribution of the JUROPA tenant 
sizes is similar to such a truncated exponential distri¬ 
bution. Fig. 18 illustrates the cloud utilization for Un¬ 
constrained^ Simple^ and LaaS, is plotted as a function 
of the exponential parameter x, which is close to the 
average tenant host size due to the truncation. The 
Unconstrained line shows how the utilization degrades 
with the job size, even without any network isolation. 
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Figure 17: (a) Measured job-size Cumulative 

Distribution Function (CDF) for the Julich JU- 
ROPA scientific-computing cloud, (b) Resulting 
cloud utilization. LaaS achieves 88 %. 



10 100 1000 

X 2 Average Tenant Size [hosts] 


Figure 18: Cloud utilization for a truncated ex¬ 
ponential distribution of tenant host sizes in a 
cloud of 11,662 hosts. 

This is an expected behavior of bin packing. As the job 
size grows, so does the probability for more nodes to be 
left nnassigned when the cloud is almost full. The uti¬ 
lization of our LaaS algorithm stays steadily at about 
10% less than the Unconstrained algorithm. Finally, 
Simple has the lowest cloud utilization for the entire 
tenant size range. Note that it is less steady, since its 
utilization is more closely tied to the sizes of the leaves 
and sub-trees. Once the tenant size crosses the leaf size 
(18 in our case), it is rounded up to a multiple of that 
number. Likewise, once it crosses the size of a complete 
sub-tree (324 hosts), it is rounded up to the nearest 
multiple of that number. These results show that our 
LaaS algorithm provides an efficient solution for avoid¬ 
ing tenant variability, as its cost is only about 10% for 
a wide range of tenant sizes. 

Fig. 19 illustrates the cloud utilization for the trun¬ 
cated Gaussian distribution. This distribution provides 
a harder test for the allocation algorithm, since tenant 
sizes are made similar, and they may be just beyond the 
above-mentioned thresholds of a leaf size (18 hosts) or 
a sub-tree size (324 hosts). These thresholds are where 
LaaS and the Simple are less efficient when compared 
to Unconstrained. 

Simple suffers from a particularly large fluctuation in 
utilization. LaaS is more stable over the entire range, 
with about 90% utilization. There are a few points 



Figure 19: Cloud utilization for a truncated 
Gaussian distribution J\f{x^x/b) of tenant host 
sizes in a cloud of 11,662 hosts. 



Figure 20 : Cloud utilization for a truncated 
Gaussian distribution Af{x^ a) of tenant host sizes 
in a cloud of 11,662 hosts, for a e {0, x/10, x/5}. 

where the Simple heuristic provides a better utilization 
than LaaS. But, note that utilization stability is key 
to cloud vendors, since changing the allocation algo¬ 
rithm dynamically would require predicting the future 
size distribution, and thus may produce worse results 
when the distribution does not behave as expected. 

Fig. 20 plots the LaaS Approximation utilization for 
different spreads of tenant sizes around the average. A 
standard deviation of avg/5, avg/10 and 0 are shown. 
The zero deviation curve exhibits the expected saw¬ 
tooth shape that is caused by the fact it is possible 
to get 100% utilization when the tenant size is a divisor 
of the number of nodes. As the deviation of the tenant 
sizes grows so does the smoothness of the curve. This is 
common to all scheduling algorithms behavior provid¬ 
ing the peaks and valleys around the job sizes crossing 
the singe leaf or sub-tree size. 

Utilization obtained for the uniform distribution of 
tenant size is presented in Fig. 21. As can be seen there 
is clear advantage to the LaaS Placement heuristic that 
maintains utilization of about 90%. 

LaaS also provides the opportunity to turn off unused 
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Figure 21: Cloud utilization vs. average tenant 
size for 10,000 requests with Uniforni(0.2x,1.8x) 
size distribution. 



Figure 22: Percentage of links that can be 
turned off in the 3-level fat-tree as a function 
of the cloud utilization. 

links that are not allocated to tenants. Fig. 22 provides 
the percentage of links that could be turned off, for 
the LaaS scheduling of the Julich distribution of tenant 
sizes. As can be observed, the average percentage of 
links that could be turned off is linear with the cloud 
utilization. As the utilization decreases the number of 
unused links grows accordingly and the network power 
can be linearly reduced. 

6.2 System Implementation 

We implemented the LaaS architecture by extending 
the OpenStack Nova scheduler with a new service that 
first runs the LaaS host and link allocation algorithm, 
and then translates the resulting allocation to an SDN 
controller that enforces the link isolation via routing 
assignments. 

Host and link allocation. The integration of the 
LaaS algorithm was done on top of OpenStack (Icecube 
release), utilizing filter type: AggregateMultiTenancy- 
Isolation. This filter allows limiting tenant placement 
to a group of hosts declared as an “aggregate”, which 
is allocated to the specific tenant-id. Our automation, 
provided as a standalone service on top of OpenStack’s 
nova controller, obtains new tenant requests, and then 
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Figure 23: Average run-time of single tenant al¬ 
location versus average tenant size. 

calls the LaaS allocation algorithm. If the allocation 
succeeds, we invoke the command to create a new ag¬ 
gregate that is further marked by the tenant-id. The 
allocated hosts are then added to the aggregate. The 
filter guarantees that a new host request, conducted by 
a user that belongs to a specific tenant, is mapped to a 
host that belongs to the tenant aggregate. 

Network controller. We further implement a method 
to provide the link allocation to the InfiniBand SDN 
controller [67], which allows it to enforce the isolation by 
changing routing. The controller supports defining sub¬ 
topologies, by providing a file with a list of the switch 
ports and hosts that form each sub-topology. Then each 
sub-topology may have its own policy file that deter¬ 
mines how it is routed. We ran the SDN controller over 
the simulated network of 1,728 hosts, as well as over our 
32-host experimental cluster. 

Run-time. The LaaS Approximation scans through 
all possible placements for valid link allocation. This 
involves evaluating all possible valid combinations of 
R and Q values. Fig. 23 presents the average run¬ 
time per tenant request for placing tenants on 11,664 
nodes cluster providing a truncated exponential tenant 
size distribution. Run time was measured on an Intel® 
Xeon® CPU X5670 @ 2.93GHz. The peak in run-time 
of about 5 msec appears just below the average tenant 
size of 324, which is the exact point where our algorithm 
first scans all possible placements under a single sub¬ 
tree and continues with multiple sub-tree placement. 

6.3 Evaluation of Tenant Performance 

Since LaaS guarantees tenant isolation, tenant per¬ 
formance should be independent of the number of other 
tenants that run on the same network. To demon¬ 
strate LaaS tenant isolation, we simulate a large cluster 
using a well known InfiniBand flit level simulator used 
by [43,51,68], 

Fig. 24 presents the relative performance of single and 
multiple tenants running Stencil scientific-computing 
applications on a cloud of 1,728 hosts, under either Lfn- 
constrained or LaaS, normalized by the performance of 
a single tenant placed without constraints. The figure 
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Figure 24: Simulated relative performance for 
tenants running Stencil scientific-computing ap¬ 
plications on a cloud of 1,728 hosts, either alone 
or as 32 concurrent tenants. While tenant per¬ 
formance degrades when placed unconstrained 
(without link isolation), the performance of sin¬ 
gle and multiple tenants with LaaS appears iden¬ 
tical, fulfilling the promise of LaaS. 

illustrates many effects. First, the performance of a 
single tenant with Unconstrained significantly degrades 
when other tenants are active, e.g. to 45% with 32-KB 
message sizes. This is because the bare-metal allocation 
of Unconstrained does not provide link isolation. Sec¬ 
ond, under our LaaS algorithm, the single-tenant per¬ 
formance is not impacted when the other tenants become 
active (the third and fourth sets of columns look identi¬ 
cal) . This was the key goal of this work. LaaS prevents 
any inter-tenant traffic contention. Finally, we can ob¬ 
serve an additional surprising effect (first vs. third 
sets of columns): the tenant performance is slightly im¬ 
proved for small messages under LaaS versus the Un¬ 
constrained allocation. The reason is that LaaS does 
not accept tenants unless it can place them with no 
contention, and therefore the resulting placement tends 
to be tighter, thus improving the run-time performance 
with small message sizes when the synchronization time 
of the tasks is not negligible. The lower network diam¬ 
eter of LaaS improves the synchronization time, which 
is latency-dominated. 

7. DISCUSSION 

Recursive LaaS. When talking to industry vendors, 
they pointed out simple extensions that would easily 
generalize the use of LaaS. First, LaaS could be applied 
recursively, by having each tenant application or each 
sub-tenant reserve its own chunk of the cloud within 
the tenant’s chunk of the cloud. Second, LaaS could 
also be applied in private clouds, with cloud chunks be¬ 
ing reserved by applications instead of tenants. Third, 
shared-cloud vendors could easily restrict LaaS to a sub¬ 
set of their cloud, while keeping the remainder of their 
cloud as it is today. This can be done by reserving 
large portions of the topology to a virtual tenant that 


is shared between many real tenants. Pre-allocation and 
modification of that sub-topology is already supported 
by our code. As a result, LaaS offers a smooth and 
gradual transition to better service guarantees, enabling 
cloud vendors to start only with the tenant owners who 
are most ready to pay for it. 

Off-the-shelf LaaS. LaaS is implementable today with 
no extra hardware cost in existing switches and no host 
changes. The algorithm requires only a moderate soft¬ 
ware change in the allocation scheme, which we provide 
as open source. It also relies on an isolated-routing fea¬ 
ture of the SDN controller, which is already available in 
InfiniBand and could be implemented in Ethernet SDN 
controllers like OpenDaylight. 

Proportional network power. LaaS eases the use 
of an elastic network link power that would be made 
proportional to cloud utilization [69]. This is because 
it explicitly mentions which links and switches are to 
be used, and therefore can turn off other links and 
switches. In other approaches the control has to happen 
as a result of traffic load change and thus is not realis¬ 
tic for common switch hardware for which the turn-ON 
time is much larger than a microsecond. 
Heterogeneous LaaS. Host allocation in heteroge¬ 
neous clouds involves allowing tenants to express their 
required host features in terms of CPU, memory, disk 
and available accelerators. On such systems, the host 
allocation algorithm should allow the provider to trade 
off the acceptance of a new tenant versus the cost of 
the available hosts, which may be higher as their ca¬ 
pabilities may exceed the user needs. Our LaaS algo¬ 
rithm could support these requirements. Although this 
requirement complicates the allocation algorithm, it is 
feasible to support it in LaaS. First, it should use the 
host costs to order the search. Second, it should try 
all the possible divisors and select the one with best 
accumulated cost. A trade-off between the resulting 
fragmentation and the cost difference could extend it. 
LaaS with VMs. LaaS could easily support multiple 
tenants running as virtual machines (VMs) on the same 
host, assuming accurate packet pacing and burst control 
is provided by hosts and switches. LaaS could then treat 
each link as a set of isolated links and assign them to 
different tenants. This includes the links leaving the 
host. 

Non-FIFO tenant scheduling. We conservatively 
evaluated our LaaS allocation algorithm assuming 
FIFO scheduling of incoming tenants. To improve the 
cloud utilization, we could equally rely on a non-FIFO 
policy, e.g. by using back-filling, reservations, or a 
jointly-optimal allocation of multiple tenants [32]. 
Fault Tolerance. When a link is down before being 
allocated it is easy to avoid allocating it to new tenants. 
However, if a link was already allocated to a tenant, 
it is not always possible to provide an alternative link 
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without breaking the current operation of the tenant. 
Similarly to losing a link on the private cloud, the tenant 
will see some degradation until the link is fixed or the 
forwarding plane is adapted. 

8. CONCLUSIONS 

In this paper, we demonstrated that the interference 
with other tenants causes a performance degradation in 
cloud applications that may exceed 65%. We introduced 
LaaS (Links as a Service), a novel cloud allocation and 
routing technology that provides each tenant with the 
same bandwidth as in its own private data center. We 
showed that LaaS completely eliminates the application 
performance degradation. We further explained how 
LaaS can be used in clouds today without any change 
of hardware, and showed how it can rely on open-source 
software code that we contributed. Finally, we also used 
previously-unpublished tenant-size statistics of a large 
scientific-computing cloud, obtained over a long period 
of time, to construct a random workload that illustrates 
how isolation is possible at the cost of some 10% cloud 
utilization loss. 
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A. LAAS SOFTWARE RELEASE 1.0 

The software is provided in [28] under the directory 
laas_1.0/ as well as in a single archive file: laas_1.0.tgz. 
In this section, we provide all the information required 
to get the LaaS service installed, and instructions to run 
a demonstration of the service. We also provide the sim¬ 
ulation setup used for obtaining the cluster utilization, 
run-time and correctness. 

The simulator and a service of LaaS are coded in 
Python and are built on top of the core algorithm coded 
in C++. At the heart of the package is the LaaS algo¬ 
rithm coded in isol.cc. It is using facilities specific to 
3-level fat-trees provided in ft3.cc and port-mask util¬ 
ity class in portmask.cc used for tracking availability of 
links. The laas.cc implements the service API provided 
in laas.h and exposed in Python using SWIG which uses 
the declarations in laas.i. We provide a scheduling sim¬ 
ulator, to obtain cluster utility, in sim.py and a tenant 
allocation service in laas_service.py. 

The LaaS service provides a RESTful interface and 
serves tenant requests [70]. It outputs OpenStack com¬ 
mand files required to control tenant host placement 
and also provides SDN configuration files to enforce iso¬ 
lation via packet routing/forwarding. 

The scheduling simulator takes a CSV file with tenant 
requests (id, size and arrival time) and process them in 
a FIFO manner. 

A.l License 

This software is provided with a choice of GPLv2 or 
BSD license and published on our website. 

A.2 Content 

The following sub-directories are included in this re¬ 
lease: 

• src - The core algorithm C++ and python ser¬ 
vice/simulator 

• bin - Random tenants generator and isol.log 
checker 

• examples - A set of files used by the demo below 

Out of the entire set of source files, the one most 
interesting for integration is lass.h which provides the 
API exposed in Python. 

A.3 Software dependency 

• Any Linux environment, for example Ubuntu 12.04 

• SWIG Version 2.0.11 

• Python 2.7 

• Python 2.7 Flask 0.7 

• Python 2.7 Flask Restful 0.3.1 

The Perl code (for utilities only) depends on: 


• Perl V5.18.2 

• Perl Math::Random 0.71 

• Perl Math::Round 0.07 

A.4 Installation 

tar xvfz laas_1.0.tgz 
cd laas_1.0/src 
make 

A.5 Running LaaS Service 

1. Choose your cluster topology: 

For ease of review we choose a small 2 level fat 
tree. The example topology is XGFT(2; 4,8; 1,4). 
Due to limitation of the current implementation 
we represent it as if it were a 3 level fat tree with 
one top switch: XGFT(3; 4,8,1; 1,4,1) The data 
needed to run a larger topology is also included in 
the examples directory. 

2. Prepare name mapping file: 

The LaaS engine eventually needs to configure 
OpenStack and an SDN controller that rely on 
physical naming and port numbering and not on 
general fat-tree indexing. A file that provides map¬ 
ping of the tree level, index within the tree and port 
indexing to the actual cluster hardware is thus re¬ 
quired. For this example topology we provide the 
mapping file: examples/pgft_m4_8_wl_4.csv. 

The first line hints at the content of each column: 

# 1vl,swidx,name,UP,upPort s,DN,dnPort s 

The example line below describes a host, providing 
its level is 0 and index is 10, its name is comp-11 
and it has a single UP port, number 1, connecting 
to LI switch (on level 1). 

0,10,comp-11,UP,1,,,,,,,,,,,,,,,,,,,,,,,, 

An example LI switch line is provided below. See 
this is the 4^^ switch in LI, its name as recognized 
by the SDN controller is SW_L1_3 and its ports 5-8 
are connecting to hosts: 

1,3,SW_L1_3,UP,1,2,3,4,DN,5,6,7,8 

Note: The file does not include any mapping for 
the non exist ant level 3 switches. 

3. Start the service: 

Once started the LaaS service reports its address 
and port. The Restful API is up and any change in 
tenant status will result in updates in the OSGfg/ 
and SDNGfg/ directories. 

1$ python ./src/laas_service.py -m 4,8,1 -w 1,4,1 \ 

-n examples/pgft_m4_8_wl_4.CSV 
I-I- Defined 64 up ports and 64 down port mappings 

I 

I* Running on http://127.0.0.1:12345/ 

I* Restarting with reloader 

I-I- Defined 64 up ports and 64 down port mappings 

4. Run a demo: 

We provide here an example sequence of calls to 
the service. After each step we discuss the results 
and the created files if any. 
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A.5 .1 List tenants: 


I $ curl http://localhost:12345/tenants 

I {} 

As expeted it returns an empty list 

A.5.2 Create a tenant of 10 nodes: 

(Expecting it will span 2.5 leafs.) 

$ curl http://localhost:12345/tenants -d "id=4" -d "n=10" -X POST 

I -c 

I "N": 10, 

I "hosts": 10, 

I "llPorts": 10, 

I "12Ports": 0 

I } 

See how the tenant-id may be any number for which 
there is no pre-existing tenant in the system. Let’s 
inspect the created files. First see the new file in 
the OSCfg: 

cmd-l.log: 

I #!/bin/bash 

I # 

I # Adding tenant 4 to OpenStack 

I # 

I echo Adding tenant 4 to OpenStack > OSCfg/cmd-l.log 
I keystone tenant-create —name laas-tenant-4 \ 

I —description "LaaS Tenant 4" » OSCfg/cmd-l.log 

I tenantId=‘keystone tenant-get laas-tenant-4 | \ 

I awk V id /{print $4}^‘ » OSCfg/cmd-l.log 
I nova aggregate-create laas-aggr-4 » OSCfg/cmd-l.log 
I nova aggregate-set-metadata laas-aggr-4 \ 

I filter_tenant_id=$tenantld » OSCfg/cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-1 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-2 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-3 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-4 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-5 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-6 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-7 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-8 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-9 » cmd-l.log 
I nova aggregate-add-host laas-aggr-4 comp-10 » cmd-l.log 

Similarly, the SDNCfg/ directory now holds a full 
set of configuration files required for OpenSM to 
configure the network. We will not go through 
the full description of these files but focus on the 
groups.conf. This file now holds the definition of 
the hosts and switch ports used by the first tenant: 

I port-group 
I name: T4-hcas 
I obj_list: 

I name=comp-l/Ul:P1 

I name=comp-2/Ul:P1 

I name=comp-3/Ul:P1 

I name=comp-4/Ul:P1 

I name=comp-5/Ul:P1 

I name=comp-6/Ul:P1 

I name=comp-7/Ul:P1 

I name=comp-8/Ul:P1 

I name=comp-9/Ul:P1 

I name=comp-10/Ul:P1; 

I end-port-group 

I 

I port-group 
I name: T4-switches 
I obj_list: 

I name=SW_Ll_2/Ul pmask=0x6 

I name=SW_Ll_0/Ul pmask=0xle 

I name=SW_Ll_l/Ul pmask=0xle; 

I end-port-group 

A. 5.3 To fill in the network we create another 

10-node tenant: 


I $ curl http://localhost:12345/tenants -d "id=l" -d "n=10" -X POST 

I { 

I "N": 10, 

I "hosts": 10, 

I "llPorts": 10, 

I "12Ports": 0 

I } 

A.5.4 List again the tenants: 

I $ curl http://localhost:12345/tenants 

I { 

I "1": { 

I "N": 10, 

I "hosts": 10, 

I "llPorts": 10, 

I "12Ports": 0 

I }, 

I "4": { 

I "N": 10, 

I "hosts": 10, 

I "llPorts": 10, 

I "12Ports": 0 

I } 

I } 

A.5.5 Get the allocated hosts and links for a 
specific tenant: 

I $ curl http://localhost:12345/tenants/l/hosts 

I [ 

I "comp-11", 

I "comp-12", 

I "comp-13", 

I "comp-14", 

I "comp-15", 

I "comp-16", 

I "comp-17", 

I "comp-18", 

I "comp-19", 

I "comp-20" 

I ] 

As expected the four spines are going to be used 
(all up ports of 2 leafs) and only 2 ports of the leaf 
SW_L1_2 holding just 2 nodes. 

I $ curl http://localhost:12345/tenants/l/llPorts 

I [ 

I { "pNum": 3, "sName": "SW_L1_2" }, 

I { "pNum": 4, "sName": "SW_L1_2" }, 

I { "pNum": 1, "sName": "SW_L1_3" }, 

I { "pNum": 2, "sName": "SW_L1_3" }, 

I { "pNum": 3, "sName": "SW_L1_3" }, 

I { "pNum": 4, "sName": "SW_L1_3" }, 

I { "pNum": 1, "sName": "SW_L1_4" }, 

I { "pNum": 2, "sName": "SW_L1_4" }, 

I { "pNum": 3, "sName": "SW_L1_4" }, 

I { "pNum": 4, "sName": "SW_L1_4" } 

I ] 

A.5.6 A bad request example: 

Now let’s see what happens if we try to over¬ 
provision the cluster by requesting a tenant of 
13 = 32 - 20 + 1 hosts: 

I $ curl http://localhost:12345/tenants -d "id=2" -d "n=13" -X POST 

I { 

I "message": "Fail to allocate tenant 2" 

I } 

A.5.7 Delete tenant 1: 

I $ curl http://localhost:12345/tenants/l -X DELETE 

We now have a command file under OSCfg/ that 
deletes the OpenStack tenant and aggregate 
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#!/bin/bash 


# 

# Removing tenant 1 from OpenStack 

# 


nova aggregate-: 
nova aggregate 
nova aggregate-: 
nova aggregate-: 
nova aggregate 
nova aggregate-: 
nova aggregate 
nova aggregate-: 
nova aggregate-: 
nova aggregate 
nova aggregate-! 
keystone tenant 


remove-host 
remove-host 
remove-host 
remove-host 
remove-host 
remove-host 
remove-host 
remove-host 
remove-host 
remove-host 
delete laas- 
delete laas 


raas-aggr 
laas-aggr 
laas-aggr 
laas-aggr 
laas-aggr 
laas-aggr 
laas-aggr 
laas-aggr 
laas-aggr 
aggr-l » 
■tenant-1 


■1 comp-11 » cmd-3.log 
■1 comp-12 » cmd-3.log 
■1 comp-13 » cmd-3.log 
■1 comp-14 » cmd-3.log 
■1 comp-15 » cmd-3.log 
■1 comp-16 » cmd-3.log 
■1 comp-17 » cmd-3.log 
■1 comp-18 » cmd-3.log 
■1 comp-19 » cmd-3.log 
■1 comp-20 » cmd-3.log 
OSCfg/cmd-3.log 
» OSCfg/cmd-3.log 


4. Check that the results are legal: 

The checker needs to know the topology size. So it 
requires this info on the command line: 

checkAllocations -n/—hosts-per-leaf n 
-k/—num-ll-per-12 m2 
-1/—total-lls tl 
-2/—total-12s t2 
-3/—total-13s t3 
-1/—log log-file 

I $ ./bin/checkAllocations -n 18 -k 18 -1 648 -2 648 -3 324 \ 
I -1 isol.log 

I -I- Checked 10000 ADD and 8760 REM jobs 
I -I- Added/Rem 35573/30117 LIPORTS and 567/482 L2P0RTS 


A.5.S Retry allocating the 13 nodes tenant: 


B. EXPERIMENTAL SETUP 


I $ curl http://localhost:12345/tenants \ 

I -d "id=2" -d "n=13" -X POST 
I { 

I "N": 13, 

I "hosts": 13, 

I "llPorts": 13, 

I "12Ports": 0 

I } 

A.6 Running Simulation of LaaS algorithm 

In this section we provide instruction for the simula¬ 
tion of a LaaS engine handling a large number of tenant 
requests. The procedure provided here is similar to the 
one used to obtain the results in the paper. In the 
paper, we also used a scheduler that implements the 
Simple and the Unconstrained algorithms. 

1. Choose your cluster topology: 

For example the maximal full bisection 3 level 
XGFT with 36 port switches is: XGFT(3; 18,18,36; 
1,18,18) It has 11,628 hosts, 648 LI, 648 L2 and 324 
L3 switches. 

2. Generate a set of tenant requests: 

We do that by running the utility 
bin/gen JobsFlow: For this example we use 
an exponential distribution with an average of 
8 hosts. The tenant run time is uniformly dis¬ 
tributed in the range [20,3000]. Please try -help 
to see other possible options. 

I ./bin/genJobsFlow -n 10000 -s 8 -r 20:3000 -a 0 > \ 
examples/exp=8_tenants=1000_arrival=0.csv 

3. Run the simulator 

After we have prepared the tenant requests file and 
decided about the topology we can run: 

I $ python ./src/sim.py -m 18,18,36 -w 1,18,18 \ 

I -c examples/exp=8_tenants=1000_arrival=0. CSV 

I -I- Obtained 10000 jobs 

I -I- first waiting job at: 20 lastJobPlacementTime 10623 
I -I- Total potential hosts ♦ time = 1.23673e+08 
I -I- Total considered jobs: 9976 skip first: 0 last: 24 
I -I- Total actual hosts ♦ time = 1.17281e+08 
I -I- Host Utilization = 94.83 % 

I -I- LI Up Links Utilization = 38.36 % 

I -I- L2 Up Links Utilization = 10.70 % 

I -I- Total Links Utilization = 48.40 % 

I -I- Run Time = 14.2 sec 

The details of each allocation/deallocation are pro¬ 
vided in the log file: isol.log. Each line describes 
one transaction and contains the total hosts/links 
as well as their detailed indices within the topology. 


B.l Hardware 

The experiment was run on the 32-node cluster pre¬ 
sented in Fig. 2. The hosts are of two types: 

• 30 hosts are HP ProLiant DL320e G8 E3-1220v2 
B120i 2xlGb lx8GB 1x500GB HOT PLUG DVD- 
RW 350W 3Y. Each containing 4-core Intel Xeon 
GPU E3-1220 V2 at 3.10GHz. 

• 2 hosts are IBM System x3450 servers featuring 
Intel Xeon processors 2.80 GHz and 3.0 GHz/1600 
MHz, with 12 MB L2, and 3.4 GHz/1600 MHz, 
with 6 MB L2. 

The InfiniBand NIGs are: MHQH19-XTG Single 4X 
IB QDR Port, PGIe Gen2 x8. Tall Bracket, RoHS-R5 
HGA Gard, QSEP Gonnector. The InfiniBand switches 
are: MIS5024Q-1BER 36-port non-blocking 40Gb/s un¬ 
managed Switch System. 

B. 2 Software 

The machines run Scientific Linux release 6.5 (Gar- 
bon). The MPI used is mvapich2-2.0rc2. Our ex¬ 
periment uses a simple MPI program that executes 
an MPEAllToAll collectives or 2 dimensional sten¬ 
cil communications using ISend/IRecv followed by 
MPEBarrier. The programs are provided under the 
sub-directory mpi_experiment. This directory also 
holds the RUN script that was used to invoke each of 
the 4 tenants MPI applications with a delay after invok¬ 
ing the previous. The host files used are also included. 

C. ETHERNET SIMULATIONS 

Eor simulation of an Ethernet-based topology 
we used an enhanced iNET framework. We 
base our code on iNET 2.2 and extend it with 
DGTGP modules. The switch forwarding is 
also enhanced with EGMP-like forwarding with a 
hash function that works modulo {SrcHostIndex -h 
DstHostlndex^ NumberOfUpPorts). The parameters 
used by the simulator are described in the following Ta¬ 
ble 1. 

The application used to generate the MapReduce 
Shuffle stage is an application that runs Scatter and 
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Parameter 

Value 

Description 

MAGRelayUnitPP.bufferSize 

65,536 

Per port buffer size, meaning total 
buffer size = bufferSize*numRealPorts 

MAGRelayUnitPP.processingTime 

3.00E-07 

Switch processing delay 

T G P. ad vert ised W indow 

65,535 

Receiver window of TGP 

TGP.delayedAcksEnabled 

false 

No delayed AGKS 

TGP.minRexmitTimeout 

0.3 

Minimal retransmission timeout 

TGP.mss 

1452 

TGP MSS 

TGP.nagleEnabled 

true 

TGP parameter 

TGP.tcpAlgorithmGlass 

DGTGPNewReno 

DGTGP based on NewReno is used 

TGPScatterGatterGlientApp.idlelnterval 

exp onent ial (200us) 

Time between successive Shuffles 
(computation time) 

TGPScatterGatterGlientApp.reconnectlnterval 

l.OOE-06 

Time to setup the new connection 

TGPScatterGatterGlientApp.replyLength 

2 

Resolver reply just AGK 

TGPScatterGatterGlientApp.request Length 

65,536 

Example data size of 64KiB from Map¬ 
per to Resolver 


Table 1: Ethernet model (iNET) parameters and their values 


then Gather from a list of nodes. To mimic the Shuffle, 
the Scatter provides parallel send of the Mapper data 
size and the Gather is of size 2 bytes only. 

The tenants are placed on hosts numbered: 

Tenant 1: 7, 3, 25, 18, 0, 13, 24, 12 
Tenant 2: 8, 9, 20, 29, 6, 1, 28, 5 
Tenant 3: 11, 4, 16, 21, 2, 17, 22, 14 
Tenant 4: 19, 15, 10, 27, 31, 26, 23, 30 

D. INFINIBAND SIMULATIONS 

The InfiniBand simulation utilizes Mellanox pub¬ 
lished model [71]. We have enhanced this model with 
an application that relies on MPI semantics and is able 
to replay MPI traces. The parameters used for our sim¬ 
ulation are provided in Table 2. 

The tenants that are placed on the 1,728-node cluster 
are of the sizes: 

• Two tenants: the two are 810 and 834. 

• Eight tenants: all are 216 nodes. 

• Thirty two tenants: all are 54 nodes. 

The tenants execute cycles of computation and com¬ 
munication. The computation time is of uniform distri¬ 
bution in the range [5,15]//sec. So the traffic to com¬ 
putation ratio for Stencil application exchanging 32KB 
of data on each dimension is: Galculation = 10//sec. 
Gommunication = = 8//sec. So the ratio of 

ideal computation to communication is 10/8 for 32KB 
exchanges. For all-to-all shuffles we increase the compu¬ 
tation time to be uniform in the range [20, 80]//sec, but 
the data is sent to each other node in the tenant. So for 
a 32KB exchange on an 8-node tenant, the ratio of com¬ 
putation to ideal communication time is ^ = 50/56. 
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Mo dule. Parameter 

Value 

Description 

IBGenerator.flit2FlitGap 

0.001 

A gap inserted between flits [nsec] 

IB Generator, flit Size 

64 

The flit size (IB credit is 64 bytes) 

IBGenerator.genDlyPerByte 

2.5e-10 

Speed of generating bytes [sec/B] 

IBGenerator.maxGontPkts 

10 

Maximum number of continuous packets of same application 

IBGenerator.pkt2PktGap 

0.001 

Gap inserted between packets [nsec] 

IBGenerator.popDlyPerByte 

2e-10 

speed of popping up data to next layer [sec/B] 

IBInBuf.maxBeingSent 

3 

Switch speedup - number of parallel packets being drained from 
input buffer 

IBInBuf.maxStaticO 

800 

Buffer size [credits] 

IBInBuf.maxVL 

0 

Maximal VL simulated 

IBInBuf. width 

4 

Link withs is 4 lanes 

IB Out Buf. credMinTime 

0.256 

Maximal time between credit updates [nsec] 

IBOutBuf.maxVL 

0 

Maximal VL simulated 

IBOutBuf.size 

66 

Host output buffer size [B] 

IBOutBuf.size 

78 

Switch port output buffer size [B] 

IB Sink, flit Size 

64 

The flit size (IB credit is 64 bytes) 

IBSink.hiccupDelay 

le+06 

The receiver may hiccup for lusec 

IBSink.hiccupDuration 

0.0001 

Length of a hiccup 

IB Sink, maxVL 

0 

Maximal VL simulated 

IBSink.popDlyPerByte 

2.5e-10 

Speed of removing Bytes to the PCIe 

IBVLArb.busWidth 

24 

Input bus width of the switch arbiter 

IB VL Arb. coreFreq 

250,000,000 

Switch core frequency 

cModule.ISWDelay 

50 

Intrinsic latency of the switch input buffer [nsec] 

cModule.VSWDelay 

50 

Intrinsic latency of the switch arbiter [nsec] 


Table 2: InfiniBand model parameters and their values 
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